Left join could use bitmap for left join instead of Vec<bool> by boazberman · Pull Request #1291 · apache/datafusion

boazberman · 2021-11-13T14:29:00Z

Which issue does this PR close?

Closes #240.
This is a new version (arrow native) of: #884 (which went stale because I didn't had the time to finish it up until now)

Rationale for this change

Described in the issue.

What changes are included in this PR?

Described in the issue.

Are there any user-facing changes?

Dandandan · 2021-11-14T14:10:00Z

+    let indices =
+        UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| {
+            (unmatched ^ visited_left_side.get_bit(v)).then(|| v as u64)
+        }));


Are we sure this isn't a degradation in performance by doing the check per item?

You mean the xor? I'm not sure how big of an impact it might have, but I can remove it if you believe it is worth it 🤷‍♂️

it would be good to create a simple hash join micro benchmark in datafusion/benches to validate this.

I will try to find the time to write such benchmark, but is there a possibility I can do it in a separate PR? Are we really that non confident that this PR will improve performance (even if by a bit)?

Can't speak for @Dandandan. At least for me the memory saving is obvious 👍 But it's unclear to me how much overhead unmatched ^ visited_left_side.get_bit(v) adds. If we can look into the generated machine code and compare, then that would be another way to prove whether it is more efficient than what we have right now.

I ran the performance tests against this branch (merged to master) and master:

For q13 (which has a left join): at TPCH Scale Factor (SF) 10 (aka around 10GB of data), on a GCP 16core 64GB RAM machine, running Ubuntu 20.04:

My results show no measureable difference

Coverted to parquet via:

cargo run --release --bin tpch -- convert --input ./data --output ./tpch-parquet --format parquet

Then benchmarked using

cargo run --release --bin tpch -- benchmark datafusion --mem-table --format parquet --path ./tpch-parquet --query 13

On master:

Query 13 iteration 0 took 10017.1 ms Query 13 iteration 1 took 10638.8 ms Query 13 iteration 2 took 10010.3 ms Query 13 avg time: 10222.08 ms

On this branch merged to master:

git checkout boazberman/master git merge origin/master

Query 13 iteration 0 took 10438.6 ms Query 13 iteration 1 took 10409.1 ms Query 13 iteration 2 took 10030.8 ms Query 13 avg time: 10292.82 ms

When I ran the same test again a few times, it reported avg times with sigificant deviation

... Query 13 avg time: 10750.95 ms ... Query 13 avg time: 10325.13 ms ... Query 13 avg time: 10460.80 ms

This leads me to conclude the very small reported difference is noise.

Note that Q13 is:

select c_count, count(*) as custdist from ( select c_custkey, count(o_orderkey) from customer left outer join orders on c_custkey = o_custkey and o_comment not like '%special%requests%' group by c_custkey ) as c_orders (c_custkey, c_count) group by c_count order by custdist desc, c_count desc;

Maybe to remove all doubt we could skip the check on each iteration. Something like (untested)

let indices = if unmached { UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| { (!visited_left_side.get_bit(v)).then(|| v as u64) })) } else { UInt64Array::from_iter_values((0..visited_left_side.len()).filter_map(|v| { (visited_left_side.get_bit(v)).then(|| v as u64) })); }

Thanks @alamb for running the benchmark. I agree with your simplification as well 👍

@boazberman are you willing to make the proposed change? I am also happy to do so

@alamb I made the proposed changes. Thanks.

houqp

LGTM, thank you @boazberman

alamb · 2021-12-21T14:16:50Z

Thanks for sticking with this @boazberman ❤️ I know it was a long trip

boaz-codota · 2021-12-21T15:06:28Z

Yay, at last 🚀

Dandandan · 2021-12-21T18:30:44Z

Thank you @boaz-codota

github-actions Bot added the datafusion label Nov 13, 2021

houqp added the performance Make DataFusion faster label Nov 14, 2021

Dandandan reviewed Nov 14, 2021

View reviewed changes

alamb mentioned this pull request Dec 7, 2021

cargo run --bin tpch fails with thread 'main' panicked at 'Argument short must be unique -p is already in use' #1409

Closed

houqp approved these changes Dec 8, 2021

View reviewed changes

boazberman added 5 commits December 18, 2021 14:28

Left join could use bitmap for left join instead of Vec<bool>

a0371dc

fix

744f8b5

Fix

b57c923

Finish implementation

fb19ea1

Update hash_join.rs

a432405

boazberman force-pushed the master branch from 3fbb789 to a432405 Compare December 18, 2021 12:55

alamb merged commit 5ef42eb into apache:master Dec 21, 2021

Uh oh!

Conversation

boazberman commented Nov 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

houqp left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 21, 2021

Uh oh!

boaz-codota commented Dec 21, 2021

Uh oh!

Dandandan commented Dec 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

boazberman commented Nov 13, 2021 •

edited

Loading